## library(devtools)
## library(BiocManager)
library(SingleCellExperiment)
library(ggplot2)
library(scFeatures) ## devtools::install_github("SydneyBioX/scFeatures")
library(ClassifyR) ## BiocManager::install("ClassifyR", dependencies = TRUE)
library(ggthemes)
library(spicyR) ## BiocManager::install("spicyR")
library(dplyr)
library(limma)
library(plotly)
library(scattermore)
library(tidyr)
library(survival)
library(survminer)
library(spatstat)
library(lisaClust)
##library(spatstat.core) ## install.packages("spatstat")
##library(spatstat.geom)
library(scater)
library(scran)
theme_set(theme_classic())
As single cell technology advances, the recent development of spatial omics allows us to examine the spatial organisation of cells within tissues in their native environment. This workshop will discuss the challenges and analytical focus associated with disease outcome prediction using multi-sample spatial datasets. We will also talk about general analytical strategies and the critical thinking questions that arise in the workflow.
sessionInfo at the end of this document to ensure you are
using the correct version.Structure for this 2-hour workshop:
| Activity | Time |
|---|---|
| Introduction to spatial technologies | 20m |
| Cell segmentation with deep learning (with BIDCell) | 20m |
| Exploring spatial data | 20m |
| Break (Q&A) | 10m |
| Extracting features from spatial data (with scFeatures) | 30m |
| Performing disease outcome classification (with ClassifyR) | 20m |
Imaging mass cytometry (IMC) is a new imaging technique that gathers spatial information to create images that show the distribution of different cell types and their associated protein expression patterns in tissue. In this demo, we examine IMC dataset taken from:
Moldoveanu D et. al. Spatially mapping the immune landscape of melanoma using imaging mass cytometry. Science Immunology, Apr;7(70):eabi5072. doi: 10.1126/sciimmunol.abi5072. PMID: 35363543.
Here, the authors quantified the expression of 35 protein markers in two melanoma cohorts. Basic characteristics of the data objects:
Response = No) and 14
responders (Response = Yes).data_sce <- readRDS("data_sce.rds")
data_sce <- logNormCounts(data_sce)
print("data format")
## [1] "data format"
data_sce
## class: SingleCellExperiment
## dim: 35 112497
## metadata(0):
## assays(2): counts logcounts
## rownames(35): SMA PDL1 ... ICOS CD29
## rowData names(0):
## colnames(112497): 26BL:1 26BL:2 ... 21RD:5410 21RD:5411
## colData names(33): orig.ident nCount_RNA ... ident sizeFactor
## reducedDimNames(2): PCA UMAP
## mainExpName: RNA
## altExpNames(0):
print("expression matrix is stored in proteins by cells matrix")
## [1] "expression matrix is stored in proteins by cells matrix"
logcounts(data_sce)[1:7, 1:7]
## 7 x 7 sparse Matrix of class "dgCMatrix"
## 26BL:1 26BL:2 26BL:3 26BL:4 26BL:5 26BL:6 26BL:7
## SMA 1.4284103 1.641814 1.41206854 1.88787898 0.56130933 0.04550849 1.64393850
## PDL1 1.2555909 1.181598 1.28170915 1.54414352 0.07956439 0.44187590 1.15627466
## OX40 . . . . . . .
## CD45 . . 0.02253247 0.02522008 0.97111591 . .
## LAG3 . . . . . . .
## TIM3 0.8617437 0.854122 0.97635985 0.91544402 1.03250567 1.16430750 0.92958376
## FoxP3 . . . . . . 0.05413394
##print("the object stores meta data (such as patient outcome information) about each cell")
##DT::datatable( data.frame(colData(data_sce))[1:5, ] , options = list(scrollX = TRUE))
In this demo, we will examine one of the cohorts - pretreatment
melanoma samples from 30 individuals with advanced disease who
subsequently received ICI therapy. We will aim to identify features that
has the potnetial to discriminate between non-responders
(Response = No) and responders (Response =
Yes).
At the start of the exploration, it is often good to get a sense of the complexity of the data. We usually do this by exploring the distribution of the outcomes and various variables in the patient’s meta-data. Here, we use cross-tabulation to examine the following variables:
print("number of responder and non responder in each type of treatment ")
## [1] "number of responder and non responder in each type of treatment "
metadata <- colData(data_sce)
metadata <- metadata[ !duplicated(metadata$Sample_ID), ]
table(metadata$Response, metadata$Treatment)
##
## both CTLA4 PD1
## No 1 11 4
## Yes 4 6 4
print("Number of patients based on tissue source")
## [1] "Number of patients based on tissue source"
table(metadata$Tissue_Source)
##
## acral brain LN other met skin_nonacral
## 3 4 9 5 9
print("Number of patients based on primary site")
## [1] "Number of patients based on primary site"
table(metadata$Primary_site)
##
## 5th left toe abdomen
## 2 1 2
## back calf chest wall
## 5 1 1
## ear l arm left arm and left elbow
## 1 1 1
## left eyelid left foot sole left shoulder
## 1 1 1
## left thumb neck right upper arm
## 1 1 1
## scalp shouler thigh skin
## 1 1 2
## toe trunk unknown
## 1 1 2
## vulva
## 1
print("Cross tabulation")
## [1] "Cross tabulation"
table(metadata$Response, metadata$Tissue_Source)
##
## acral brain LN other met skin_nonacral
## No 1 3 5 3 4
## Yes 2 1 4 2 5
Typically in single-cell data analysis, we perform dimension reduction to project the high dimensional cell by gene matrix on to 2D space. This allows us to visualise various things of interest, such as distribution of cell types and disease outcomes. In this dataset, cells were classified in the following cell types based on their markers:
Note: for single-cell RNA-seq with around 20,000 genes, we often need to perform some filtering (e.g. select highly variable genes) to reduce the number of features. Here, given we have less than 50 proteins, there is no need to pre-filter. That being said, we provide some sample code below (commented out) to demonstrate how to identify highly variable genes followed by UMAP visualisation in scRNA-seq data.
# gene_var <- modelGeneVar(data_sce)
# hvgs <- getTopHVGs(gene_var, prop=0.1)
# data_sce <- runUMAP(data_sce, scale=TRUE, subset_row = hvgs)
data_sce <- runUMAP(data_sce, scale=TRUE)
a <- plotUMAP(data_sce, colour_by = "Cluster.v2")
b <- plotUMAP(data_sce, colour_by = "Response")
c <- plotUMAP(data_sce, colour_by = "Sample_ID")
a + b + c
Interactive Q&A:
What can we learn from these illustrations? Is there anything interesting in the plot? Questions to consider include:
print("Optional")
## [1] "Optional"
metadata <- colData(data_sce)
metadata <- cbind(metadata, reducedDim(data_sce, "UMAP"))
metadata <- data.frame(metadata)
plotlist <- list()
thispatient <- unique(metadata$Sample_ID)[1]
for ( thispatient in unique(metadata$Sample_ID)){
metadata$selected_patient <- ifelse( metadata$Sample_ID == thispatient, "seleted patient" , "other patients")
p <- ggplot(metadata, aes(x =UMAP1 , y = UMAP2 , colour = selected_patient )) + geom_scattermore(pointsize = 0.5) + ggtitle(thispatient) + scale_colour_manual(values = c("grey" , "red"))
plotlist [[thispatient]] <- p
}
ggarrange(plotlist = plotlist , ncol = 5, nrow = 6 , common.legend = T )
The advantage with spatial omics is that we can examine the organisation of the cell types as it occurs on the tissue slide. Here, we visualise one of the slides from a patient. As an optional exercise, you can
to give a sense of what is random ordering.
one_sample <- data_sce[, data_sce$Sample_ID == unique(data_sce$Sample_ID)[1]]
one_sample <- colData(one_sample)
one_sample <- data.frame(one_sample)
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette(10)
a <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =Cluster.v2)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Original slide")
print("Optional: Permute the cell type labels")
## [1] "Optional: Permute the cell type labels"
one_sample$Cluster.v2_permute <- sample(one_sample$Cluster.v2)
b <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =Cluster.v2_permute)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the cell type label")
print("Optional: Permute the spatial coordinate")
## [1] "Optional: Permute the spatial coordinate"
one_sample$Location_Center_X_permute <- sample(one_sample$Location_Center_X)
one_sample$Location_Center_Y_permute <- sample(one_sample$Location_Center_Y)
c <- ggplot(one_sample, aes(x = Location_Center_X_permute , y = Location_Center_Y_permute, colour =Cluster.v2)) + geom_point(alpha=0.7) + scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the X, Y coordinate")
a + b + c
Critical thinking:
We begin by examining how can we identify and visualise regions of tissue where spatial associations between cell-types are similar. There are many packages that perform this task andhere we use the lisaClust function that is based on “local L-function” to spatially cluster cells into different regions with similar cell type composition.
set.seed(51773)
BPPARAM <- simpleSeg:::generateBPParam(2)
# Cluster cells into spatial regions with similar composition.
data_sce <- lisaClust(
data_sce ,
k = 5,
Rs = c(20, 50, 100),
sigma = 50,
spatialCoords = c("Location_Center_X", "Location_Center_Y"),
cellType = "Cluster.v2",
imageID = "Sample_ID" ,
regionName = "region",
BPPARAM = BPPARAM
)
metadata <- colData(data_sce)
metadata <- metadata[ metadata$Sample_ID == metadata$Sample_ID[1], ]
metadata <- data.frame(metadata)
plotlist <- list()
plotlist_celltype <- list()
thisregion <- unique(metadata$region)[1]
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90")
names(color_codes) <- c( unique(metadata$Cluster.v2) , "other regions")
for ( thisregion in sort(unique(metadata$region))){
selected_region_index <- metadata$region == thisregion
metadata$selected_region <- "other regions"
metadata$selected_region[selected_region_index] <- "selected region"
metadata$celltype <- metadata$Cluster.v2
metadata$celltype[!selected_region_index ] <- "other regions"
metadata$celltype <- factor(metadata$celltype, levels = c(unique(metadata$Cluster.v2), "other regions"))
p <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = selected_region )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = c("grey" , "red"))
p2 <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = celltype )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = color_codes)
plotlist [[thisregion ]] <- p
plotlist_celltype [[thisregion ]] <- p2
}
ggarrange(plotlist = plotlist , ncol = 5, nrow = 1 , common.legend = T )
ggarrange(plotlist = plotlist_celltype , ncol = 5, nrow = 1 , common.legend = T )
We can better interpret the region output by summarising the proportion of each cell type in a region across the patients. Looking at the composition of cell types in each region, comparing between responder and non-responders.
df <- data.frame(colData( data_sce))
df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
this_df <- df[df$Sample_ID == thispatient, ]
temp_df <- table( this_df$region, this_df$Cluster.v2 )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$reponse <- unique( this_df$Response )
df_plot <- rbind(df_plot, temp_df)
}
df_plot <- df_plot %>% dplyr::group_by( Var1 , Var2, reponse) %>%
summarise(mean_proportion = mean(Freq))
# r1 <- df_plot[ df_plot$Var1 == "region_1", ]
ggplot(df_plot, aes(y = Var2, x = Var1 ,colour =mean_proportion , size = mean_proportion ))+ geom_point() +
facet_grid(~reponse, scales = "free", space = "free" ) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
xlab("Region" ) + ylab("Celltype") + scale_colour_viridis_c()
Interactive Q&A:
Q4: Which regions appear to be different between responders and non-responders?
df <- data.frame(colData( data_sce))
df <- df %>% dplyr::group_by(Sample_ID ,Response, region) %>%
count(Cluster.v2) %>%
mutate(proportion = n / sum(n))
ggplot(df, aes(y = proportion, x = Sample_ID , fill = Cluster.v2))+ geom_col()+facet_grid(~region+Response, scales = "free", space = "free" ) + scale_fill_manual(values = c(color_codes)) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Interactive Q&A:
Q5: Does your conclusion change after looking at a different plot?
We see that region 1 appears to suggest the non-responder patients have more melano. Region 3 appears to be the tumor microenvironment with lots of Th.ae (antigen-experienced) and macro.mono (macrophage and monocytes) cell types. Let’s focus on region 1 and region 3 and visualise the data with boxplots, as well as comparing to the overall cell type proportion without segmenting into regions.
df <- data.frame(colData( data_sce))
df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
this_df <- df[df$Sample_ID == thispatient, ]
temp_df <- table( this_df$region, this_df$Cluster.v2 )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$reponse <- unique( this_df$Response )
df_plot <- rbind(df_plot, temp_df)
}
df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
a <- ggplot(df_plot_region_1, aes(x = Var2, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0,1)
df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]
b <- ggplot(df_plot_region_3, aes(x = Var2, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3")
overall <- NULL
for ( thispatient in unique(df$Sample_ID)){
this_df <- df[df$Sample_ID == thispatient, ]
temp_df <- table( this_df$Cluster.v2 )
temp_df <- temp_df /sum(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$reponse <- unique( this_df$Response )
overall <- rbind(overall, temp_df)
}
c <- ggplot(overall, aes(x = Var1, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
a + b + c
Often you may have a marker in mind to further examine the expression of key marker genes in these region specific cell types. For example, we select cells that have high Ki67 expression. (ie, only keeping the cells that have Ki67 expression higher than the median Ki67 expression in the whole dataset). We choose Ki67 as an example here because Ki67 is strongly associated with tumor cell proliferation and growth and is widely used as a biomarker in cancer analysis.
median_ki67 <- median( logcounts(data_sce[ "Ki67" , ]))
data_sce$ki67 <- ifelse( logcounts(data_sce[ "Ki67" , ]) > median_ki67, "high_ki67" , "low_ki67")
df_plot <- NULL
for ( thispatient in unique(df$Sample_ID)){
this_df <- df[df$Sample_ID == thispatient, ]
temp_df <- table( this_df$region, this_df$Cluster.v2 )
temp_df <- temp_df / rowSums(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$reponse <- unique( this_df$Response )
df_plot <- rbind(df_plot, temp_df)
}
df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
a <- ggplot(df_plot_region_1, aes(x = Var2, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0,1)
df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]
b <- ggplot(df_plot_region_3, aes(x = Var2, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3")
overall <- NULL
for ( thispatient in unique(df$Sample_ID)){
this_df <- df[df$Sample_ID == thispatient, ]
temp_df <- table( this_df$Cluster.v2 )
temp_df <- temp_df /sum(temp_df)
temp_df <- data.frame( temp_df)
temp_df$patient <- thispatient
temp_df$reponse <- unique( this_df$Response )
overall <- rbind(overall, temp_df)
}
c <- ggplot(overall, aes(x = Var1, y = Freq, colour = reponse)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
a + b + c
Discussion:
Comparing the overall composition and the cell type composition in the region, what did you observe about the regions?
In this demo, we use scFeatures to generate molecular representation for each patient from the matrix of proteins by cells. The molecular representation is interpretable and hence facilitates downstream analysis of the patient. Overall, scFeatures generates features across six categories representing different molecular views of cellular characteristics. These include:
The different types of features constructed enable a more comprehensive multi-view understanding of each patient from a matrix of proteins x cells.
Given in the previous section, we clustered the cells into regions, we can use the region information as an additional layer of information on top of the cell types to examine region-specific cell-type specific features.
region <- data_sce$region
region <- gsub("_" , "", region)
data_sce$celltype <- paste0( data_sce$Cluster.v2 , "-" , region)
print("number of cells in each sample")
## [1] "number of cells in each sample"
table(data_sce$Sample_ID)
##
## 01RD 02RD 04RD 05RD 06RD 08BL 09RD 10RD 12RD 13RD 14RD 16BL 19BL 21RD 22RD 23RD
## 2539 2556 3448 4460 4176 1924 5605 4537 2796 4661 2105 2408 2626 5411 4742 2611
## 24RD 25RD 26BL 29RD 31RD 32RD 33RD 34RD 35RD 37RD 39RD 40RD 41BL 42RD
## 4059 5774 3968 2230 4522 5366 4061 2478 4325 4032 2391 3730 4848 4108
print("number of cells in each celltype - Region specific cell type")
## [1] "number of cells in each celltype - Region specific cell type"
table(data_sce$celltype)
##
## B-region1 B-region2 B-region3
## 87 2942 649
## B-region4 B-region5 CD31-region1
## 68 148 164
## CD31-region2 CD31-region3 CD31-region4
## 932 205 371
## CD31-region5 macro.mono-region1 macro.mono-region2
## 1320 752 6395
## macro.mono-region3 macro.mono-region4 macro.mono-region5
## 668 1385 3999
## melano-region1 melano-region2 melano-region3
## 6754 37987 493
## melano-region4 melano-region5 other-region1
## 2549 3232 1278
## other-region2 other-region3 other-region4
## 14848 338 7605
## other-region5 Tc.ae-region1 Tc.ae-region2
## 2954 295 1486
## Tc.ae-region3 Tc.ae-region4 Tc.ae-region5
## 766 347 1694
## Tc.naive-region1 Tc.naive-region2 Tc.naive-region3
## 101 190 146
## Tc.naive-region4 Tc.naive-region5 Th.ae-region1
## 149 475 165
## Th.ae-region2 Th.ae-region3 Th.ae-region4
## 659 1890 148
## Th.ae-region5 Th.naive-region1 Th.naive-region2
## 651 42 98
## Th.naive-region3 Th.naive-region4 Th.naive-region5
## 622 45 220
## Treg-region1 Treg-region2 Treg-region3
## 50 195 427
## Treg-region4 Treg-region5 unclassified-region1
## 48 236 187
## unclassified-region2 unclassified-region3 unclassified-region4
## 1938 187 465
## unclassified-region5
## 452
Discussion:
Are there any samples or cell types you would like to remove from the data?
All the feature types can be generated in one line of code. This runs
the function using default settings for all parameters. For more
information, type ?scFeatures.
# scFeatures requires the following columns
# celltype, sample, x_cord and y_cord
# alternatively, these can be also set as argument in the scFeatures function
data_sce$sample <- data_sce$Sample_ID
data_sce$x_cord <- data_sce$Location_Center_X
data_sce$y_cord <- data_sce$Location_Center_Y
# here, we specify that this is a spatial proteomics data
# scFeatures support parallel computation to speed up the process
scfeatures_result <- scFeatures(data_sce , type = "spatial_p" , ncores = 10 )
We have generated a total of 13 feature types and stored them in a
list. All generated feature types are stored in a matrix of samples by
features. For example, the first list element contains the feature type
proportion_raw, which contains the cell type proportion
features for each patient sample. We could print out the first 5 columns
and first 5 rows of the first element to see.
scfeatures_result <- readRDS("scfeatures_result_region_specific.rds")
# combine sample name with outcome
scfeatures_result_format <- scfeatures_result
outcome <- data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Response
outcome <- unname(outcome)
for ( i in c(1:length(scfeatures_result_format))){
this <- scfeatures_result_format[[i]]
rownames(this) <- paste0(rownames(this), "_cond_", outcome )
scfeatures_result_format[[i]] <- this
}
# we have generated a total of 13 feature types
names(scfeatures_result_format)
## [1] "proportion_raw" "proportion_logit" "proportion_ratio"
## [4] "gene_mean_celltype" "gene_prop_celltype" "gene_cor_celltype"
## [7] "gene_mean_bulk" "gene_prop_bulk" "gene_cor_bulk"
## [10] "L_stats" "celltype_interaction" "morans_I"
## [13] "nn_correlation"
# each row is a sample, each column is a feature
data.frame(scfeatures_result_format[[1]][1:5, 1:5])
## B.region1 B.region2 B.region3 B.region4 B.region5
## 26BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0005040323
## 19BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0000000000
## 08BL_cond_No 0.000000000 0.001039501 0.00000000 0.0005197505 0.0015592516
## 41BL_cond_No 0.000000000 0.000000000 0.00000000 0.0000000000 0.0000000000
## 31RD_cond_No 0.008624502 0.030075188 0.03803627 0.0006634233 0.0055285272
## DT::datatable(meta_table , options = list(scrollX = TRUE))
Once the features are generated, you may wish to visually explore the features.
Here, we plot a volcano plot and a dotplot for the region specific cell type specific expression feature.
gene_mean_celltype <- scfeatures_result_format$gene_mean_celltype
# this transposes the data
# in bioinformatics conversion, features are stored in rows
# in statistics convention, features are stored in columns
gene_mean_celltype <- t(gene_mean_celltype)
# pick CD31-region5 as an example cell type
gene_mean_celltype <- gene_mean_celltype[ grep("B-region4", rownames(gene_mean_celltype)), ]
condition <- unlist( lapply( strsplit( colnames(gene_mean_celltype), "_cond_"), `[`, 2))
condition <- data.frame(condition = condition )
design <- model.matrix(~condition, data = condition)
fit <- lmFit(gene_mean_celltype, design)
fit <- eBayes(fit)
tT <- topTable(fit, n = Inf)
tT$gene <- rownames(tT)
p <- ggplot( tT , aes(logFC,-log10(P.Value) , text = gene ) )+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
xlab("log2 fold change") + ylab("-log10 p-value") + theme_minimal()
a <- ggplotly(p)
a
b <- ggplot( tT , aes( y = reorder(gene, logFC) , x = logFC ) )+
geom_point(aes(colour=-log10(P.Value)), alpha=1/3, size=1) +
scale_colour_gradient(low="blue",high="red")+
xlab("logFC") + ylab("region specific cel type specfic features" ) + theme_minimal()
b
Interactive Q&A:
Q6: Which figure do you prefer? The volcano plot or the dotplot?
To accommodate for easier interpretation of the features, scFeatures
contains a function run_association_study_report that
enables the user to readily visualise and explore all generated features
with one line of code.
Tips: Some categories of features such as cell cell interaction takes a long time to run. You may use parallel computation to speed up the process or select a small number of feature categories to reduce computational time.
# specify a folder to store the html report. Here we store it in the current working directory.
output_folder <- getwd()
run_association_study_report(scfeatures_result, output_folder )
Interactive Q&A:
Using the HTML, we can look at some of the critical thinking questions that a researcher would ask about the generated features. These questions are exploratory and there is no right or wrong answer.
Q7: Do the generated features look reasonable?
In this section, we will perform disease outcome classification using the molecular representation of patients. Recall in the previous section that we have stored the 13 feature types matrix in a list. Instead of manually retrieving each matrix from the list to build separate models, classifyR can directly take a list of matrices as an input and run a repeated cross-validation model on each matrix individually. Below, we run 5 repeats of 3-fold cross-validation.
outcome <- data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Response
outcome <- unname(outcome)
### generate classfyr result
classifyr_result <- crossValidate(scfeatures_result,
outcome,
classifier = "kNN",
nFolds = 3,
nRepeats = 5,
nCores = 20 )
To examine the classification model performance, we first need to specify a metric to calculate. Here, we calculate the balanced accuracy and visualise the accuracy using boxplots.
classifyr_result <- readRDS("classifyr_result_region_specific.rds")
classifyr_result <- lapply(classifyr_result,
function(x) calcCVperformance(x, performanceType = "Balanced Accuracy"))
level_order <- names(scfeatures_result)
p <- performancePlot(classifyr_result) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_discrete(limits = level_order)
p
Interactive Q&A:
Q8: Based on the classification performance and survival (see Appendix below), which feature type would you like to focus on at your next stage of analysis?
The dataset has a survival outcome. Apart from performing prediction on responder versus non-responder, here we highlight the use of scFeatures for survival analysis.
survival_day <- unname( data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Survival_from_Rx_Start)
censoring <- unname( data_sce[, match( rownames(scfeatures_result[[1]]), data_sce$Sample_ID)]$Known_Deceased)
i <- 13
plotlist <- list()
for (i in c(1 : length( scfeatures_result ))){
feature_name <- names(scfeatures_result)[i]
feature <- scfeatures_result[[i]]
feature <- t(feature)
# run hierarchical clustering
hclust_res <- hclust(
as.dist(1 - cor(feature )),
method = "ward.D2"
)
cluster_res <- cutree(hclust_res, k = 2)
metadata <- data.frame( cluster = factor(cluster_res),
survival_day = survival_day,
censoring = censoring)
# plot survival curve
fit <- survfit(
Surv(survival_day, censoring) ~ cluster,
data = metadata
)
ggsurv <- ggsurvplot(fit,
conf.int = FALSE, risk.table = TRUE,
risk.table.col = "strata", pval = TRUE,
xlim = c(0,700), break.time.by = 100
) + ggtitle( feature_name)
plotlist[[feature_name]] <- ggsurv
}
arrange_ggsurvplots( plotlist, print = TRUE,
ncol = 3 , risk.table.height = 0.3)
The L function is a spatial statistic used to assess the spatial distribution of cell types. It assesses the significance of cell-cell interactions, by calculating the density of a cell type with other cell types within a certain radius. High values indicate spatial association, low values indicate spatial avoidance.
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90")
names(color_codes) <- c( unique(data_sce$Cluster.v2) , "other regions")
one_sample <- data_sce[ , data_sce$Sample_ID == "16BL" ]
one_sample <- data.frame( colData(one_sample) )
one_sample$celltype <- one_sample$Cluster.v2
index <- one_sample$celltype %in% c("macro.mono", "Tc.ae")
one_sample$celltype[!index] <- "others"
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient 16BL - high L value with \n macro.mono interacting Tc.ae")
one_sample$celltype <- one_sample$Cluster.v2
index <- one_sample$celltype %in% c("melano", "Tc.ae")
one_sample$celltype[!index] <- "others"
b <- ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient 16BL - low L value with \n melano interacting Tc.ae")
a + b
We calculate the nearest neighbours of each cell and then calculate the pairs of cell types based on the nearest neighbour. This allows us to summarise it into a cell type interaction composition.
tableau_palette <- scale_colour_tableau()
color_codes <- tableau_palette$palette( 10 )
color_codes <- c(color_codes, "darkgrey" , "grey90")
names(color_codes) <- c( unique(data_sce$Cluster.v2) , "other regions")
one_sample <- data_sce[ , data_sce$Sample_ID == "16BL" ]
one_sample <- data.frame( colData(one_sample) )
one_sample$celltype <- one_sample$Cluster.v2
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point() + scale_colour_manual(values = color_codes) + ggtitle( "Patient 16BL")
feature <- scfeatures_result$celltype_interaction
to_plot <- data.frame( t( feature["16BL", ]) )
to_plot$feature <- rownames(to_plot)
colnames(to_plot)[2] <- "celltype interaction composition"
to_plot <- to_plot[ to_plot$X16BL > 0.03 , ]
b <- ggplot(to_plot, aes( x = reorder(`celltype interaction composition`, X16BL) , y = X16BL, fill=`celltype interaction composition`)) + geom_bar(stat="identity" ) + ylab("Major cell type interactions") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
a+ b
Moran’s I is a spatial autocorrelation statistic based on both location and values. It quantifies whether similar values tend to occur near each other or are dispersed.
high <- data_sce["Ki67", data_sce$Sample_ID == "25RD" ]
high_meta <- data.frame( colData(high) )
high_meta$expression <- as.vector(logcounts( high))
low <- data_sce["Ki67", data_sce$Sample_ID == "42RD" ]
low_meta <- data.frame( colData(low) )
low_meta$expression <- as.vector(logcounts(low))
a <- ggplot(high_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient 25RD - high Moran's I in Ki67")
b <- ggplot(low_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient 42RD - low Moran's I in Ki67")
a+b
This metric measures the correlation of proteins/genes between a cell and its nearest neighbour cell.
We see from both prediction and survival analysis that the feature type “nn correlation” (nearest neighbour correlation) performs the best.
Here we pick the protein “S100”, a key player in cancer, as an example to illustrate the concept.
plot_nncorrelation <- function(thissample , thisprotein){
sample_name <- thissample
thissample <- data_sce[, data_sce$Sample_ID == sample_name]
exprsMat <- logcounts(thissample)
cell_points_cts <- spatstat.geom::ppp(
x = as.numeric(thissample$Location_Center_X ), y = as.numeric(thissample$Location_Center_Y),
check = FALSE,
xrange = c(
min(as.numeric(thissample$Location_Center_X)),
max(as.numeric(thissample$Location_Center_X))
),
yrange = c(
min(as.numeric(thissample$Location_Center_Y)),
max(as.numeric(thissample$Location_Center_Y))
),
marks = t(as.matrix(exprsMat))
)
value <- spatstat.explore::nncorr(cell_points_cts)["correlation", ]
value <- value[ thisprotein]
# Find the indices of the two nearest neighbors for each cell
nn_indices <- nnwhich(cell_points_cts, k = 1)
protein <- thisprotein
df <- data.frame(thiscell_exprs = exprsMat[protein, ] , exprs = exprsMat[protein,nn_indices ])
p <- ggplot(df, aes( x =thiscell_exprs , y = exprs , colour = exprs )) +
geom_point(alpha = 0.3) + ggtitle(paste0( "Patient ", sample_name , " nn_corr = " , round(value, 2) )) + scale_colour_viridis_c()
return (p )
}
p1 <- plot_nncorrelation( "42RD" , "S100" )
p2 <- plot_nncorrelation( "29RD" , "S100" )
p1 + p2
The correlation differs between the 42RD patient (from cluster 1) and the 29RD patient (from cluster 2).
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Sydney
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scran_1.28.1 scater_1.28.0
## [3] scuttle_1.10.1 lisaClust_1.8.0
## [5] spatstat_3.0-6 spatstat.linnet_3.1-1
## [7] spatstat.model_3.2-4 rpart_4.1.19
## [9] spatstat.explore_3.2-1 nlme_3.1-162
## [11] spatstat.random_3.1-5 spatstat.geom_3.2-1
## [13] spatstat.data_3.0-1 survminer_0.4.9
## [15] ggpubr_0.6.0 tidyr_1.3.0
## [17] scattermore_0.8 plotly_4.10.2
## [19] limma_3.56.2 dplyr_1.1.2
## [21] spicyR_1.12.0 ggthemes_4.2.4
## [23] ClassifyR_3.4.8 survival_3.5-5
## [25] BiocParallel_1.34.2 MultiAssayExperiment_1.26.0
## [27] generics_0.1.3 scFeatures_0.99.27
## [29] ggplot2_3.4.2 SingleCellExperiment_1.22.0
## [31] SummarizedExperiment_1.30.2 Biobase_2.60.0
## [33] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1
## [35] IRanges_2.34.1 S4Vectors_0.38.1
## [37] BiocGenerics_0.46.0 MatrixGenerics_1.12.2
## [39] matrixStats_1.0.0
##
## loaded via a namespace (and not attached):
## [1] SpatialExperiment_1.10.0 R.methodsS3_1.8.2
## [3] GSEABase_1.62.0 progress_1.2.2
## [5] tiff_0.1-11 EnsDb.Mmusculus.v79_2.99.0
## [7] goftest_1.2-3 Biostrings_2.68.1
## [9] HDF5Array_1.28.1 vctrs_0.6.1
## [11] digest_0.6.32 png_0.1-8
## [13] shape_1.4.6 EnsDb.Hsapiens.v79_2.99.0
## [15] ggrepel_0.9.3 deldir_1.0-9
## [17] parallelly_1.36.0 magick_2.7.4
## [19] MASS_7.3-59 reshape2_1.4.4
## [21] httpuv_1.6.11 foreach_1.5.2
## [23] withr_2.5.0 xfun_0.39
## [25] ellipsis_0.3.2 commonmark_1.9.0
## [27] cytomapper_1.12.0 memoise_2.0.1
## [29] proxyC_0.3.3 ggbeeswarm_0.7.2
## [31] systemfonts_1.0.4 zoo_1.8-12
## [33] GlobalOptions_0.1.2 gtools_3.9.4
## [35] SingleCellSignalR_1.12.0 pbapply_1.7-2
## [37] R.oo_1.25.0 prettyunits_1.1.1
## [39] KEGGREST_1.40.0 promises_1.2.0.1
## [41] httr_1.4.6 rstatix_0.7.2
## [43] restfulr_0.0.15 globals_0.16.2
## [45] fitdistrplus_1.1-11 rhdf5filters_1.12.1
## [47] rhdf5_2.44.0 rstudioapi_0.14
## [49] miniUI_0.1.1.1 concaveman_1.1.0
## [51] babelgene_22.9 curl_5.0.1
## [53] zlibbioc_1.46.0 ScaledMatrix_1.8.1
## [55] polyclip_1.10-4 GenomeInfoDbData_1.2.10
## [57] fftwtools_0.9-11 xtable_1.8-4
## [59] stringr_1.5.0 evaluate_0.21
## [61] S4Arrays_1.0.4 BiocFileCache_2.8.0
## [63] hms_1.1.3 irlba_2.3.5.1
## [65] colorspace_2.1-0 filelock_1.0.2
## [67] ROCR_1.0-11 reticulate_1.30
## [69] magrittr_2.0.3 lmtest_0.9-40
## [71] viridis_0.6.3 later_1.3.1
## [73] lattice_0.21-8 future.apply_1.11.0
## [75] XML_3.99-0.14 cowplot_1.1.1
## [77] RcppAnnoy_0.0.21 svgPanZoom_0.3.4
## [79] class_7.3-22 pillar_1.9.0
## [81] simpleSeg_1.2.0 EBImage_4.42.0
## [83] iterators_1.0.14 caTools_1.18.2
## [85] compiler_4.3.1 beachmat_2.16.0
## [87] stringi_1.7.12 tensor_1.5
## [89] minqa_1.2.5 GenomicAlignments_1.36.0
## [91] plyr_1.8.8 msigdbr_7.5.1
## [93] crayon_1.5.2 abind_1.4-5
## [95] BiocIO_1.10.0 ggtext_0.1.2
## [97] locfit_1.5-9.8 sp_2.0-0
## [99] terra_1.7-39 bit_4.0.5
## [101] codetools_0.2-19 BiocSingular_1.16.0
## [103] crosstalk_1.2.0 bslib_0.5.0
## [105] multtest_2.56.0 mime_0.12
## [107] splines_4.3.1 markdown_1.7
## [109] circlize_0.4.15 Rcpp_1.0.10
## [111] dbplyr_2.3.2 sparseMatrixStats_1.12.2
## [113] gridtext_0.1.5 knitr_1.43
## [115] blob_1.2.4 utf8_1.2.3
## [117] AnnotationFilter_1.24.0 lme4_1.1-34
## [119] nnls_1.4 listenv_0.9.0
## [121] DelayedMatrixStats_1.22.1 GSVA_1.48.2
## [123] ggsignif_0.6.4 tibble_3.2.1
## [125] Matrix_1.5-4.1 scam_1.2-14
## [127] statmod_1.5.0 svglite_2.1.1
## [129] tweenr_2.0.2 pkgconfig_2.0.3
## [131] pheatmap_1.0.12 tools_4.3.1
## [133] cachem_1.0.8 RSQLite_2.3.1
## [135] viridisLite_0.4.2 DBI_1.1.3
## [137] numDeriv_2016.8-1.1 fastmap_1.1.1
## [139] rmarkdown_2.23 scales_1.2.1
## [141] grid_4.3.1 ica_1.0-3
## [143] Seurat_4.3.0.1 shinydashboard_0.7.2
## [145] Rsamtools_2.16.0 broom_1.0.5
## [147] sass_0.4.6 patchwork_1.1.2
## [149] BiocManager_1.30.21 graph_1.78.0
## [151] carData_3.0-5 RANN_2.6.1
## [153] farver_2.1.1 mgcv_1.8-42
## [155] yaml_2.3.7 rtracklayer_1.60.0
## [157] cli_3.6.1 purrr_1.0.1
## [159] leiden_0.4.3 lifecycle_1.0.3
## [161] uwot_0.1.16 bluster_1.10.0
## [163] backports_1.4.1 DropletUtils_1.20.0
## [165] annotate_1.78.0 gtable_0.3.3
## [167] rjson_0.2.21 ggridges_0.5.4
## [169] progressr_0.13.0 parallel_4.3.1
## [171] ape_5.7-1 jsonlite_1.8.7
## [173] edgeR_3.42.4 bitops_1.0-7
## [175] bit64_4.0.5 Rtsne_0.16
## [177] spatstat.utils_3.0-3 BiocNeighbors_1.18.0
## [179] SeuratObject_4.1.3 RcppParallel_5.1.7
## [181] highr_0.10 jquerylib_0.1.4
## [183] metapod_1.8.0 dqrng_0.3.0
## [185] survMisc_0.5.6 R.utils_2.12.2
## [187] lazyeval_0.2.2 shiny_1.7.4
## [189] htmltools_0.5.5 KMsurv_0.1-5
## [191] sctransform_0.3.5 rappdirs_0.3.3
## [193] ensembldb_2.24.0 glue_1.6.2
## [195] XVector_0.40.0 RCurl_1.98-1.12
## [197] jpeg_0.1-10 gridExtra_2.3
## [199] AUCell_1.22.0 boot_1.3-28.1
## [201] igraph_1.5.0 R6_2.5.1
## [203] gplots_3.1.3 labeling_0.4.2
## [205] km.ci_0.5-6 GenomicFeatures_1.52.1
## [207] cluster_2.1.4 Rhdf5lib_1.22.0
## [209] nloptr_2.0.3 vipor_0.4.5
## [211] DelayedArray_0.26.6 tidyselect_1.2.0
## [213] ProtGenerics_1.32.0 raster_3.6-23
## [215] ggforce_0.4.1 xml2_1.3.4
## [217] car_3.1-2 AnnotationDbi_1.62.2
## [219] future_1.33.0 rsvd_1.0.5
## [221] munsell_0.5.0 KernSmooth_2.23-21
## [223] data.table_1.14.8 htmlwidgets_1.6.2
## [225] RColorBrewer_1.1-3 biomaRt_2.56.1
## [227] rlang_1.1.1 spatstat.sparse_3.0-2
## [229] lmerTest_3.1-3 fansi_1.0.4
## [231] beeswarm_0.4.0
The authors thank all their colleagues, particularly at The University of Sydney, Sydney Precision Data Science and Charles Perkins Centre for their support and intellectual engagement. Special thanks to Ellis Patrick, Shila Ghazanfar, Andy Tran for guiding and supporting the building of this workshop.